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^ I Abstract 



In the frequentist program, inferential methods with exact control on error rates 
C^ I are a primary focus. Methods based on asymptotic distribution theory may not be 

suitable in a particular problem, in which case, a numerical method is needed. 
This paper presents a general, Monte Carlo-driven framework for the construction 
of frequentist procedures based on plausibility functions. It is proved that the suit- 
ably defined plausibility function-based tests and confidence regions have desired 
frequentist properties. Moreover, in an important special case involving likelihood 
ratios, conditions are given such that the plausibility function behaves asymptoti- 
cally like a consistent Bayesian posterior distribution. An extension of the proposed 
method is also given for the case where nuisance parameters are present. A number 
Cn . of examples are given to illustrate the method's flexibility and to demonstrate its 

ly-s I strong performance compared to existing numerical and analytical methods. 

^O I Keywords and phrases: Bootstrap; confidence region; likelihood; Monte Carlo; 

vQ I profile likelihood. 
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(N 1 Introduction 

In the Neyman-Pearson program, construction of statistical procedures, such as con- 
S^ I fidence intervals or hypothesis tests, having control over frequentist error rates is an 

important problem. But despite this importance, the construction of such procedures 
remains relatively ad hoc. Indeed, one would typically cook up some summary statistic 
and derive a procedure based on the statistic's exact or asymptotic sampling distribution. 
For example, to construct a confidence region, one might invoke the asymptotic normal- 
ity of the maximum likelihood estimate. These first-order methods are the most widely 
used in p ractice, but procedures with highe r-order accuracy are also available; see, for 



example, iReidI (120031 ). iBrazzale et al.l (120071 ). and the references therein. However, the 



challenging analytical calculations often needed to implement these methods has led to 
their relatively slow acceptance in applied work. 

In cases where an analytical solution is not available, numerical methods are needed. 



Perhaps the most popular numerical method is the bootstrap (e.g.. lDavison and Hinkley 



19971 : lEfron and Tibshiranilll993f ). The bootstrap is beautifully simple and, perhaps be- 
cause of its simplicity, has made a tremendous impact on statistics; cf. the special 2003 
issue of Statistical Science (Volume 18, Issue 2). However, the theoretical basis for the 
bootstrap is also asymptotic, the idea being that the bootstrap distribution will approxi- 
mate the true data-generating distribution, which can only be justified in large-samples. 
Moreover, the bootstrap cannot be used bli ndly as a black- box type of procedure, for 
there are cases where the bootstrap fails (e.g. lDasGuDtall2008l, Chap. 29.7) and the reme- 
dies for bootstrap failure are somewhat counterintuitive (e.g.. lBickel et al.lll997l ). In light 
of the bootstrap's subtleties, it seems desirable to search for a numerical method able to 
produce exact frequentist procedures for general inference problems. 

Numerical methods, particularly M arkov chain Monte C a rlo, have made a significan t 



impact in Bayesiaii statistics; see, e.g.. lGelman et al.l (J2004f ). iRobert and Casellal (J2004f ) 



Ghosh et al.l (120061 ) . and the references therein. But Monte Carlo methods have yet to 
have the same kind of impact in the frequentist program. In this paper, I propose a new, 
simple, unified, Monte Carlo-driven approach to the construction of frequentist inference 
procedures. In particular, if 6* G is the parameter of interest, I construct a plausibility 
function p\y{A) that measures the plausibility of the claim "6' G A" after seeing data 
y. The precise definition of ply(-) is given in Section [2.H but it starts with a choice of 
^-dependent scalar- valued function Ty^g of y, such as the likelihood ratio. Then ply(A) 
arises from a probability calculation involving random sets associated with Ty^g and the 
set A C O. It is rare that these probabilities can be calculated in closed- form, so Monte 
Carlo integration is a critical part of the proposed method. 

Details on how the plausibility function can be used for inference are given in Sec- 
tion 1221 but the main idea is that the assertion/hypothesis A is doubtful whenever pl2^(A) 
is sufficiently small. I show that a decision rule which rejects the hypothesis "^ G A" 
if and only if ply(A) < a controls the frequentist Type I error at level a. This, as 
usual, can be turned around to construct plausibility regions for 6 which have nominal 
control on the frequentist coverage probability. These results are stated formally in Sec- 
tion 12.4. 1[ To investigate large-sample consistency of the proposed method, I treat the 
plausibility function the same way a Bayesian might treat a posterior probability in an 
asymptotic framework. That is, I show that, under certain conditions, for any assertion 
A outside a neighborhood of the true parameter value, the plausibility function evaluated 
at A vanishes almost surely. Th is parallels a Bayesian's notion of posterior consistency 
fJGhosh and Ramamoorthill2003l . Sec. 1.3), and implies that decisions made with plausi- 
bility functions are correct with probability arbitrarily close to 1, for n sufficiently large. 

In many problems, the parameter of interest is some lower- dimensional function of 
the full parameter 6; that is, 6 = (V',A), where ip is the parameter of interest and A 
is a nuisance parameter. For the challenging problem of inference on ip in the presence 
of unknown A, I propose a marginal plausibility function in Section [31 In this context, 
it is natural to replace Tyg with a similar quantity Ty^^ that does not depend directly 
on the nuisance parameter A. I argue that if Ty^^i, is A-ancillary, i.e., the distribution 
of Ty^^i, as a function of Y is free of A, then the basic theory in Section 12.41 carries over 
directly to the more general case. What can be done when Ty^^p is not exactly A-ancillary 
is also discussed. Examples are given for illustration, and applications to inference in 
random-effects models and variable selection is regression are also presented. 

Throughout I focus primarily on the construction of confidence regions, with the ex- 



ception of Section I3.4.2I The hypothesis testing problem is essentially the same, but 
computationally harder in general. Examples are given in Sections I2.5[ 13. 3[ and 13.41 to 
illustrate the proposed method. The results are compared to those based on existing 
methods, such as likelihood theory and bootstrap. The general message is that the pro- 
posed plausibility function-based inference is as good or better than existing methods in 
a variety of problems. In a simple but practically important Gaussian random effects 
model (Section l3. 4. ip . it is shown numerically that the proposed method provides exact 
inference, while the standard parametric bootstrap fails. R codes for all the examples in 
this paper are available at my website (www.mat h.uic. ed.u/~rgmart in) . Remarks con- 
cerning the interpretation of the plausibility function are deferred to the special Section m 
A brief discussion appears in Section 13 proofs of theorems are given in Appendix |X1 

2 Plausibility functions 

2.1 Construction 

Let y be a sample from distribution Pg on Y, where 6 is an unknown parameter taking 
values in 0, a separable metric space; here Y could be, say, a sample of size n from a 
product measure Pg, but I have suppressed the dependence on n in the notation. 

To start, choose a real- valued function Ty^g of the observed data Y = y and the 
unknown parameter 6. In this paper I shall focus on the relative likelihood 

Ty,g=Lyie)/Lyie), (l) 

where ^ is a maximizer of the likelihood function Ly{9). Other choices of Ty^g are possible 
(see Remark [1] in Section Hj) but the use of likelihood is reasonable since it conveniently 
summarizes all information in y concerning 9. There is also a computational advantage 
in that ([T]) depends on y only through sufficient statistics; see Section 12.31 Let Fg be the 
distribution function of Ty^ when Y ~ Pg, i.e., 

Fg{t) = Pg{TY,g < t), teR. (2) 

Often Fg(t) will be smooth function of t, but the discontinuous case is also possible. To 
avoid measurability issues, I shall silently assume throughout that Fg{t) is a continuous 
function in 6 for each t. 

The goal here is to present a new and unified approach to the frequentist inference 
problem. Towards this goal, I shall first take a somewhat unexpected turn. Consider a 
data-dependent collection of subsets of O: 

Oyiu) = {9 : Fg{Ty^g) >u}, ue (0, 1). (3) 

The sets Qy{u) are clearly nested, i.e., for fixed y, if u' > u, then Qy{u') C Qy{u). In fact, 
for Tyg in ([T]), Qy{u) is similar to the usual likelihood level set used to construct confidence 
regions; the difference is the appearance of Fg, but see ([5]). In general, one intuitively 
expects that the true 9 is likely to reside in Qy{u) when u is not too large. My next step 
is to introduce an additional source of randomness with y fixed. Specifically, consider 
the random set Qy{U) where U ~ Unif(0, 1). The choice of a uniform distribution for U 



is (i) consistent with the classical use of Qy{-) for confidence regions, and (ii) necessary 
for the frequentist properties to be derived in Section r2.4.1[ Now, for any subset A C 0, 
consider the probability pl„(A) that the random set Qy{U) touches A: 



p\M) 



I{Qy{u) gLA'}du 



suY>Fe{Ty 



(4) 



Continuity of Fg in 6 and separability of 6 ensure that plj^(A) is a measurable function in 
y for each A. The function p\y{-), defined on 2®, is called a plausibility function. When 
A = {9} is a singleton set, I shall write p\y{9) instead of p\y{{6}). Intuitively, plj^(v4) 
measures the plausibility of the claim that A contains the true value of 6, denoted by 6*, 
given observation Y = y. To better understand this intuition, observe that Qy{U) will, 
for typical U, likely contain the true 9*. So, large plj^(A) indicates that a typical Qy{U) 
simultaneously contains 6* and points in A] therefore, the case 6* & A is plausible in the 
sense that it cannot be ruled out. 

Besides as a tool for constructing frequentist procedures (see Section [X^ . the plausi- 
bility function has some potentially deeper applications. Indeed, plj^(^) can be interpreted 
and used "subjectively," much like p-values and Bayesian posterior probabilities. More- 
over, there are some conne ctions to sey eral alternative inferential fram eworks, includin g 



Fisher's fiducial approach (JFisherlll973l ). D empster-Shafer theory (e.g.. lDempsterll2008l ). 
and the new inferential model framework ( Martin and Liull2012l ). See Remarks [2] and E] 
in Section H] for more discussion along these lines. 



2.2 Use in statistical inference 

The plausibility function can be used for a variety of statistical inference problems. First, 
consider the point estimation problem. Perhaps the most natural plausibility function- 
based point estimator is the maximizer 6"^ of p\y{0). If Ty^g is the relative likelihood ([T]), 

then 6'^ = 6, and the desirable properties of 6 are inherited by 6^ . But if Tyg is something 
other than the relative likelihood (see Remark [1]), then a different plausibility-based 
estimator can be obtained. 

The plausibility function p\y{9) can also be used to construct confidence regions. This 
will be my primary focus throughout the paper. Specifically, for any a G (0, 1), define 
the 100(1 — a)% plausibility region 



Uy{a) = {9 : p\y{9) > a}. 



(5) 



The intuition is that 9 values which are sufficiently "plausible," given Y = y, are good 
guesses for the true parameter value. When Ty ^ is th e relative likelihood in ([1]), the plau- 
sibility region is equivalent to that in lSpj0tvolll (119721 ). Two important related properties 
i) since p\y{9'^) = 1, the maximum plausibility estimator 9"^ always resides in 



oiIly{a) are: 



n„ 



.a 



so the plausibility region is never empty; (ii) 



Ily{a) 



respects non-trivial constraints 
These two 



in 9, i.e., Ily{a) contains no 9 values outside the effective parameter space 
properties motivated the "unified approach" develo ped by iFeldman and CousinsI (119981 ) 
and further studied by lZhang and Woodroofd (120021 ) — in fact, these are important special 
cases of the method presented here. 



Next consider a hypothesis testing problem, Hq : 6 E Qq versus Hi : 6 ^ Qq. Define a 
plausibihty function-based test as follows: 

reject Hq if and only if plj^(9o) < a. (6) 

The intuition is that if Go is not sufficiently "plausible," given Y = y, then one should 
conclude that the true 6 is outside Gq. Further understanding of ([6]) follows from the 
fact that pi (Go) is a type of p-value for the null hypothesis Gq. In Section [2^ I show 
that this plausibility function-based test has desirable frequentist properties; the well- 
known connection between confidence regions and hypothesis tests shows that these same 
desirable properties carry over to the plausibility regions described previously. 

2.3 Implementation 

Evaluation of FolTyg) is crucial to the proposed methodology. In select problems, it may 
be possible to derive this distribution exactly, but this generally will not be the case. Here 
I present a simple Monte Carlo approximation of F^. See Remarks H] and [5] in Section H] 
for a discussion of the advantages and limitations of this approach compare with other 
methods, including the parametric bootstrap. 

To estimate Fg{Ty^g), where Y = y is the observed sample, first choose a large number 
M; in all the examples presented herein, I (conservatively) use M = 5 x 10^. Then 
construct the following root-M consistent estimate of Fg(Ty^g): 

1 ^ 
Fe{Ty,e) = - J^ /{T^..),, < T,,,}, Y'^'\ ..., F^ '^ P,. (7) 

m=l 

This strategy can be performed for any choice of 6, so we may consider Fg{Ty^0) as a 
function of 6. If necessary, optimization over a set A C G can be carried out using any 
standard routine, such as optim in R. To compute a plausibility interval, solutions to the 
equation Fg(Ty^g) = a are required. T hese can be obtained using , for e xample, standard 



bisection or stochastic approximation fJGarthwaite and Bucklandlll992[ ). 



Note that it may not be necessary to simulate a full data set Y^"^^ at each Monte 
Carlo step. Indeed, if Ty^e is a variant of the likelihood, then it is enough to simulate a 
sufficient statistic. When F = Fg does not depend on 6 (see, e.g.. Theorem 12]), then it is 
not necessary to simulate the y(™) values for each value of 9, i.e., the same simulations 
can be used to evaluate F{Ty^g) for each 6. When different simulations are needed for 
each 6, it may be possible to implement an importance weight resampling procedure to 
reduce the time spent on simulations. In general, there is no closed-form expression for 
Ty^g, in which case, it is important to have efficient numerical solutions so that the Monte 
Carlo approximation ([7]) is not too expensive. 

2.4 Sampling distribution properties 

2.4.1 Fixed-n theory 

The first result describes the sampling distribution of ply(A) as a function of F ~ Pg. 
The sample size n is fixed and, hence, omitted from the notation. Fixed-n properties of 
p\y{A) are critical to the advertised exactness of the proposed procedure. 



Theorem 1. Let A be a subset of 0. For any 6 E A, if Y r^ Pq, then ply(y4) is 
stochastically larger than uniform. That is, for any a G (0, 1), 

supPejplyl^) < «} < «. (8) 

The claim that the plausibihty function-based test in (E]) achieves the nominal fre- 
quentist size follows as an immediate corollary of Theorem [T] 



Corollary 1. For any a G (0,1), the size of the test (^ is no more than a. That is, 
supg^Qg Pe{ply(0o) <«}<«• Moreover, if Hq is a point-null, so that Qq is a singleton, 
and Tyfi is a continuous random variable when F ~ P^, then the size is exactly a. 

Proof. Apply Theorem [T] with A = Gq. □ 

As indicated earlier, the case where A is a singleton set is an important special case for 
point estimation and plausibility region construction. The result in Theorem [1] specializes 
nicely in this singleton case. 

Theorem 2. (i) If Tyfi is a continuous random variable as a function ofY r^ Pg, then 
ply(^) is uniformly distributed, (ii) If Ty^ is a discrete random variable when Y ~ Pg, 
then p\y{0) is stochastically larger than uniform. 

The promised result on the frequentist coverage probability of the plausibility region 
Ily{oi) in ([5]) follows as an immediate corollary of Theorem |2] 

Corollary 2. For any a G (0, 1), the plausibility region Ily{a) has the nominal frequen- 
tist coverage probability; that is, Pg{IlY{a) 3 6} > 1 — a. Furthermore, the coverage 
probability is exactly 1 — a under condition (i) of Theorem 

Proof. Observe that P6){ny(a;) 3 0} = Pg{p\y{9) > a}. So according to Theorem [21 
this probability is at least 1 — a. Moreover, under condition (i) of Theorem [21 p\y{0) is 
uniformly distributed, so "at least" can be changed to "equal to." D 

Theorems [Hand [21 and their corollaries, demonstrate that the proposed method is valid 
for any sample size. Therefore, the level of inferential accuracy is completely determined 
by computational resources, e.g., Monte Carlo sample size M, and is unrelated to the 
physical sample size n. So, unlike the bootstrap or other analytical approximations, 
plausibility function-based inference does not require a large-sample justification. 

From Theorem [21 it is clear that the distinction between discrete and continuous Ty^g 
is important. The quantity Ty^g is continuous if, for example, Ty^g is a continuous function 
of y and F ~ Pgi is a continuous random variable. However, this need not be the case 
in general. Indeed, it is possible, even if Y is continuous, that Tyg depends on Y only 
through some finite-valued summary, in which case Ty^g is discrete. 

An interesting question is if, and under what conditions, the distribution function 
Fg does not depend on 6. An answer to this question helps simplify evaluation of the 
plausibility function (see Section [23]); there are also positive consequences regarding the 
efficiency of the plausibility function-based inference. Next I describe a context where 
this ^-independence can be discussed in general. 



Let ^ be a group of transformations (? : Y — ?■ Y, and let ^ be a corresponding group 
of transformations (7 : O — ?■ O defined by the invariance condition: 

if y ~ Pe, then gY ~ P-ge, (9) 

where, e.g., gy denotes the image of y under transformation g. Note that g and g are 
tied together by the relation ([9]). Models that satisfy iQ are called group transformation 
models. Common examples are location- and scale-parameter problems. 

Theorem 3. Suppose ([9]) holds for groups ^ and'^ as described above. If^ is transitive 
on Q and Ty0 is "bi-invariant,", i.e., 

Tgy,ge = Ty fi for ally eY and g E'^, (10) 

then the distribution function Fg in (|2]) does not depend on 9. 

Corollary 3. Suppose the model has dominating measure relatively invariant with respect 
to '^ . Then ([9]) holds and, the relative likelihood Ty^ , in ([T]), satisfies flTOj) . Therefore, if 
^ is transitive, the distribution function Fg in (^^ does not depend on 9. 



Proof. lEatonl fll989l . Theorem 3.1) establishes Q. Moreover, lEatonl fll989l . pp. 46-47) 



argues that, under the stated conditions, the likelihood satisfies Ly{9) = Lgy{g9)x{g), for 
a multiplier x that does not depend on y or 9. This invariance result immediately implies 
(ITU]) for the relative likelihood Ty^e- Now apply Theorem [31 □ 

2.4.2 Asymptotic theory 

The theoretical results in this subsection cover only the case where Ty^g is the relative 
likelihood in ([T]), which is general enough for my purposes here. However, I expect that 
similar results would hold for other choices of Tyg, although special techniques (maximal 
inequalities, etc) would likely be needed. 

There are two ways one can look at asymptotic theory of plausibility functions. The 
first, which is rather informal, is to consider the asymptotic distribution of Ty^g, where 
Y = (Yi, . . . , Yn), before constructing the plausibihty function. For example, under suit- 
able regularity conditions, when n is large, the distribution of — 2 log Ty.e, for Ty^g as in 
([T]), is approximately a chi-square. Therefore, the distribution function Fg{t) oiTyg is ap- 
proximately P{Q > — 21ogt), where Q is an appropriate chi-square random variable. One 
can then proceed to construct a plausibility function based on this approximate distribu- 
tion function for Ty^g. Since the limiting distribution of Ty^g is free of 9, the plausibility 
function-based inference is asymptotically efficient in the sense of Theorem [31 

The second, and more formal approach is to calculate the plausibility function as 
described in Section 12.11 and then study its properties as n — )• 00. In this subsection, 
write pl„(-) instead of ply(-) to emphasize the dependence on n through Y = (Yi, . . . , Yn). 
Similarly, L„ denotes the likelihood Ly and Fn^g the distribution function Fg in ([2]). 
The goal here is to establish a sort of plausi bility function consistency, simi lar to the 



consistency of Bayesian posterior distributions f lGhosh and Ramamoorthill2003l . Sec. 1.3). 
Here I shall say that the plausibility function is consistent at 9* if pl„(^) — )■ with Pg*- 
probability 1 for any subset A of the complement of an open neighborhood of 9*. Such 

7 



a property implies that the chance of making an incorrect decision with the plausibihty 
function-based procedures in Section [2^21 is asymptotically small. 

Consider first a simple case. As before, let 9* denote the "true" value of the parameter, 
and consider a singleton assertion A = {9}. Since Ty^ < Ln{9) / Ln{9*) and -F„,e(-) is non- 
decreasing, if ^ 7^ ^*, then 

pl„(^) < F„,,(L„(e)/L„(r)) = F„„,(e-"^"(^*'^)), 

where Kn,{9*,9) = n~^ log{Ln{9*) / Ln{9)} is the empirical Kullback-Leibler divergence. 
By the law of large numbers, with Pe*-probability 1, there exists N such that Kn{9*, 9) > 
for all n > N. Therefore, pl„(6') — )■ with Pg* -probability 1 for all 9 ^ 9*. That is, the 
plausibility function will correctly distinguish between the true 9* and any 9^9* with 
probability 1 for large n. Compare this to a simple well-known result for the likelihood 



function, e.g., or lLehmann and Casellal (jl998l . Theorem 6.3.2). 



Now consider an assertion A, which is a subset of the complement of an open neigh- 
borhood of 9*. It is shown in (IT9l) that, for any rj > 0, 



P,.{pl„(A) >r/} < P,Jsup^^ > iniFnM]. {H] 



From (iTTj) it is clear that proving consistency requires two things: controlling tail proba- 
bilities of supgg^Tyg and keeping F'lirf) sufficiently far away fro m zero for fixed r?. Th e 
first hurdle can be overcome using empirical process theory (e.g.. lWong and Shenlll995l ). 



The second hurdle can be handled with some additional model assumptions, though I 
suspect the result holds more generally; see Figure [1] and the relevant discussion. 

Suppose (Fi, . . . , Yn) G Y" are iid P^, and that Pq admits a density pe with respect to 
a common cr-finite measure v on Y; write !^ = {pg : 9 E Q}. For any two non- negative 
i^-integrable functions {f,g) on Y, let H{f,g) = J{f^^'^ — g^^'^Y du denote the Hellinger 
distance between / and g. For pe.Ve' € ^ , write h{9, 9') = H{pg,pg'); ii 9 \-^ pg is one-to- 
one, then h{9, 9') defines a metric on 6. Assume that, at least in a small neighborhood 
of the true 9*, there is a constant C such that h{9*,9) < C\\9 — 9*\\, where || ■ || is the 
natural metric on 6. A Hellinger e-bracket is a. set {p E J3^ : i < p < u}, where {i,u) 
are two non-negative i/-integrable functions with H{i,u) < e. For a subset ^o ^ ^i 
define A''[](e, J^q,H) to be the smallest number of Hellinger e-brackets needed to cover 
^0- Finally, let < and > denote inequality up to a universal constant. 

Theorem 4. Let 9* denote the "true" parameter value, and take any rj e (0, 1). In addi- 
tion to the assumptions in the preceding paragraph, suppose there exists positive constants 
a > 0, b = b{ri) > such that, for any e > and for all large n. 



where ^q = ^ fl {pg : H{pg*,pg) < 2e}, and 



f \hgN[]ix/a,^o,H)y^'dx<V^e', (12) 

Je2/28 



inf F~l{r]) > e~'"''\ (13) 



Then the plausibility function pl„(-) is consistent at 





fa) Poisson 



(b) gamma-cxponcntial mixture 



Figure 1: Plots of distribution functions Fq for a range of 9 values for two distributions 
with n = 25 (in gray). Diagonal line is the uniform distribution function. 



Theorem H] implies that, as n — >■ oo, the Type II error probability of the test vanishes 
and, similarly, the plausibility region is shrinking to the singleton {9*}. In fact, Theorem|4] 
actually gives more than just consistency — there is some notion of a rate at which pl„(-) 
collapses to a point mass at 9* . For example, if (e„) is a vanishing sequence such that 
the conditions flT2l) and flT3|) hold with £:„ in place of e, then the same proof shows that 



pl„({6' : \\9 — 9*\\ > £„,}) — 7- in Pg* -probability, provided that ne^ — )■ oo. The latter 



condition can be relaxed to net 
Bayesian results in, e.g. 



Ghosal et al. 



O p(1) w ith some extra effort. Compare this to the 
( 20001 ). Therefore, the plausibility regions shrink 
with n at a rate roughly n"^/^. 

The two main conditions, flT^ and flT^ . of Theorem H] are rather abstract, so some 
comments are in order. First, for parametric families <^, which is the focus here, the 
Hellinger bracketing metric entropy logA^[](x/a, ^ fl {pe '■ H{pg*,pg) < 2e},H) is typi- 
cally of the order logfae/x). So co ndition (fT2|) holds with e of the order n~^/^. See Re- 
mark (ii) in lWong and ShenI (119951 . p. 351). Second, unlike uniform control of likelihood 



ratios, which is ubiquitous in statistical theory, bounding the quantile function F~l seems 
unique to the present approach. Theorem [3] guarantees that, for group-transformation 
models, Fn^ is free of 9 and positive on (0, 1); thus, flT3l) holds trivially. Outside group- 
transformation models, if — 21ogTye is asymptotically chi-square, then the fact that Fn^g 
is converging pointwise to a smooth distribution function supported on (0, 1) provides 
some comfort, though flT^ should hold more generally. I contend that Ty^ is stochasti- 
cally larger than Unif(0, 1), uniformly in 9, in which case, (TT3|) holds trivially. Figure [1] 
shows plots of Fg for a range of 9 values for two different distributions — Poisson and Lind- 
ley's gamma-exponential mixture (Example Hj). The stochastic dominance seems clear, 
but a general proof of this claim escapes me. In any case, one can always check (ITS]) 
heuristically by drawing a picture like in Figure [TJ 



2.5 Examples 

Example 1 . Inference on the success probability 6 bas ed on a samp l e Y from a binomial 
distribution is a fundamental problem in statistics. iBrown et al.l ( 1200 ll . 120021 ) showed 



that the widely-used Wald confidence interval 9 ± z* ,2(^(1 ~ ^)/'^}^^^) where 9 = Y/n is 
the maximum likelihood estimate, often suffers from strikingly poor frequentist coverage 
properties, and that the Wilson, Agresti-CouU, and Jeffreys' prior Bayes intervals are all 
better. In this problem, to construct a plausibility interval for 9, write 

^,Q,[Pe{Y<y) + Pe{Y>2n9-y) ii y < n9 
'^^^ yPeiY >y) + Pe{Y <2n9-y) iiy>n9. 

Then the endpoints of the plausibility interval can be found by solving the equation 
ply(^) = a with bisection. It follows from the general theory that this plausibility interval 
has guaranteed coverage probability \ — a. However, the discreteness of the problem 
implies that the coverage is conservative. 

Example 2. Consider a binary response variable Y that depends on a set of covari- 
ates X = {l,xi, . . . ,Xp)~^ G MP~^^. An important special case is the probit regression 
model, with likelihood function given by Ly{9) = YYi=i ^ixj9y^[l — ^{xj9)Y~'^\ where 
yi, . . . ,yn are the observed binary response variables, xt = (l,Xji, . . . ,Xip)~^ is a vector 
of covariates associated with yi, $ is the standard Gaussian distribution function, and 
9 = {9q, 9i, . . . , 9p)~^ 6 6 is an unknown vector of regre ssion coefficients. This likelihood 



function can be maximized, e.g., via the EM algorithm (JDempster et al.lll977l ). to obtain 
the maximum likelihood estimate 9 and, hence, the relative likelihood Tyg in ([T]). Then 
the plausibility function p\y{9) can be evaluated as in (jTj). 

For illustration, I consider a real data s et wi th a single covariate {p = 1). The 



data, presented in Table 8.4 in iGhosh et al.l (120061 . p. 252), concerning the relationship 



between exposure to choleric acid and the death of mice. In particular, the covariate x is 
the acid dosage and y = 1 ii the exposed mice dies and y = otherwise. Here a total of 
n = 120 mice are exposed, ten at each of the 12 dosage levels. Figure [2] shows the 95% 
plausibility region for 9 = {9o,9i)~^ . The center point is the maximum likelihood estimate 
9 = (—2.13,8.73)"''. For comparison, the 95% confidence region based on the asymptotic 
normality of 9 is also given. In this case, the plausibility and confidence regions are almost 
indistinguishable, likely because n is relatively large. The 0.95 coverage probability of 
the plausibility region is, however, guaranteed and its similarity to the classical region 
suggests that it is also approximately efficient. 

Example 3. Let Y = {(Yii,Yi2) : i = 1, . . . ,n} he a. sample from a standard bivariate 
Gaussian distribution with correlation coefficient 9; the more general case is considered 
in Example [71 Figure [3] shows a plot of the plausibihty function p\y{9), where y is the 
observed sample of size n = 25 from the standard bivariate Gaussian distribution with 
correlation 9* = 0.1. The peak is at the maximum likelihood estimate 9 = —0.035, and 
the a = 0.05 level set gives the 95% plausibility interval for 9. The dashes on the 9- 
axis mark the 95% confidence interval based on Fisher's exact distribution of the sample 
correlation. In this case, both intervals have exact coverage, but the plausibility interval 
is considerably shorter. 
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Figure 2: Solid line gives the 95% plausibility region for 6 = {9o, 9i)~^ in the probit 
regression problem in Example HI The center point denotes the maximum likelihood 
estimate 6 = (—2.13,8.73)''^. Dashed line gives the 95% confidence region based on the 
asymptotic normality of the maximum likelihood estimate. 
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Figure 3: Plausibility function in Example 121 Asterisk marks 6* = 0.1; vertical line marks 
6 = —0.035; horizontal line at a = 0.05 determines the cutoffs for the 95% plausibility 
interval. Dashes on the ^-axis mark endpoints of Fisher's exact interval. 
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Example 4. Let Yi, . . . ,Yn be independent samples from a distribution with density 
Pdiy) = ^^(^ + l)~^(y + l)e"^^, for y,9 > 0. This non- standard distr ibution, a mix- 



ture of a gamma and an exponential density, appears in iLindleyi ( 119581 ) . In this case 



9 = {1 — y + (y^ + 6y + l)^/^}/2y, and the relative likelihood is 

Ty^, = [6 /Of ''{{6 + l)l{e + i)}"e"^(^~-^). 

For illustration, I compare the frequentist coverage probability of the 95% plausibility 
interval versus that of two likelihood-based methods: standard asymptotic normality of 
9 and the parametric bootstrap (with B = 5000 bootstrap samples). With 1000 random 
samples of size n = 50 from the distribution above, with 6* = 1, the estimated coverage 
probabilities are 0.949, 0.911, and 0.942 for the plausibility, asymptotic normality, and 
bootstrap methods, respectively. Thus, the plausibility interval hits the desired coverage 
probability on the nose, while the other two intervals fall short, especially the one based 
on asymptotic normality. This example also demonstrates the efficiency of the plausibility 
function approach outside the group transformation model context. 

3 Marginal plausibility functions 

3.1 Construction 

In many cases, 6 can be partitioned as 6' = {ip, A) G \1/ x A where ip is the parameter of 
interest and A is a nuisance parameter. For example, ip could be just a component of the 
parameter vector 6 or, more generally, ip is some function of 6. In such a case, the approach 
described above can be applied with special kinds of sets, e.g., A = {9 = {ip, X) : X E A}, 
to obtain marginal inference. However, it may be easier to interpret a redefined marginal 
plausibility function. The natural extension to the methodology presented in Section |2] is 
to consider some function Ty^^ of data y and interest parameter ip. For example, replace 
the relative likelihood ([T]) with the relative profile likelihood 

Ty^^ = Ly{ij,X^)/Ly{ij,X^), (M) 

where A^ is the conditional maximum likelihood estimate of A when ip is treated as fixed, 
and ip = argmax^ Ly{ip, X^). As before, other choices of Ty^^ are possible, but (IT^ is an 
obvious choice and shall be my focus in what follows. 

If Ty,^p is A-ancillary, i.e., the distribution function F^ of Ty^^, when Y ~ P^, is free 
of the nuisance parameter A, then the development in the previous section carries over 
without a hitch. That is, one can carry out the same Monte Carlo approximation to F^ 
to construct a marginal plausibility function 

mp\y{A) = snpF4Ty^^), AC^, (15) 

which in turn can be used exactly as in Section 12.21 for inference on the parameter ip of 
interest. The distribution function F^ can be approximated via Monte Carlo just as in 
Section 12. 3j see f lT6|) below. Unfortunately, checking that F^ is free of A is a difficult 
charge in general. This issue is discussed further in the next subsection. 
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3.2 Theoretical considerations 



It is straightforward to verify that the fixed-ra samphng distribution properties (Theo- 
renis[TH2]) of the plausibihty function carry over exactly in this more general case, provided 
that Ty^^ in flT^ is A-ancillary. Consequently, basic properties of the plausibility regions 
and tests (Corollaries [I]-[2]) also hold in this case. It is rare, however, that Ty,^ can be 
written in closed-form, so general conditions on the model are needed to verify this. 

Following the ideas in Corollary |3l it is natural to consider models having a special 
structure. The particular structure of interest here is that where, for each fixed ip, 
P^^x is a group transf ormation model w i th re spect to A. This is called a composite 
transformation model; iBarndorff-NielsenI (119881 ) gives several examples, and Example [7] 
below gives another. For such models, the proof of Theorem [3] above shows that the 
relative profile likelihood Ty^ip is A-ancillary. Therefore, frequentist inference based on 
the marginal plausibility function mpl is exact in composite transformation models. 

What if the problem is not a composite transformation model? In some cases, it is 
possible to show directly that Ty,^ is A-ancillary (see Example [H] below) but, in general, 
this seems to be a difficult task. Large-sample theory can, however, provide some guid- 
an ce. Consider, for example, the asym ptotic expansions of the profile likelihood presented 



m 



Murphv and van der VaartI (J2000l ) fo r the iid case ; deta ils specific to the parametric 
problem of interest here can be found in lBickel et al.l (119981 Chap. 2). In particular, it is 
shown that, under certain regularity conditions, — 2 log Ty^^ is asymptotically chi-square 
when Y = (Yi, . . . , ¥„) are iid P^^a — irrespective of A. The fact that this limiting result 
free of A suggests that, at least for large n, the nuisance parameter has only a "lower- 
order" effect on the sampling distribution of Ty^^p, thus approximate A-ancillarity. This, 
in turn, suggests the following intuition: since A has only a minimal effect, construct a 
marginal plausibility function for ip, by fixing A to be at some convenient value Aq. That 
is, a Monte Carlo approximation of F^ can be obtained as 



F^{T, 



y,tp) 



1 ^' 



y(i) ___ y(A/) 



iid 



^M- 



(16) 



m=l 



One can go a step further and consider different choices of Ty^^ that might be less sen- 
sitive to the choice of A. For example, there are quantities, common in higher-order in- 
ference literature — the Bartlett correction to the likelihood ratio or the signed likelihood 
root — which are known to have faster convergence to a limiti ng distribution, suggest- 



ing that there is actually less dependence on the choice of A (IB arndorff- Nielsen] 11986 



Barndorff-Nielsen and Halllll988l : ISkovgaardI l200ll ) . Such quantities have also been used 
in conjunction with boots trap /Monte Ca r lo sch emes that avoids use of the approximate 
limiting distribution; see iDiCiccio et al.l (l200l[ ) and iLee and Youngl ( 120051 ). These ad- 
justments, special cases of the general program proposed here, did not appear to be 
necessary in the examples considered below. However, further work is needed along these 
lines, particularly in the case of high-dimensional A. 



3.3 Examples 

Example 5. Start with a simple illustrative example: Yi,...,y„ independent with dis- 
tribution H{ip,X), where 9 = {ip,X) is completely unknown, but only the mean ip is of 
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interest. A simple calculation shows that the relative profile likelihood is 

Ty,^ = [l + n(Y - i,f / S^'''^ 

where S'^ = X]r=i(^« ~ ^)^ ^^ ^^^ usual residual sum-of-squares. Since the ratio inside the 
brackets is like the squared t-statistic, one can easily see that the marginal plausibility 
interval for ip is exactly the textbook t-interval. Essentially the same calculations show 
that the marginal plausibility interval for a slope (5^ in a Gaussian linear regression 
problem is the same as the usual t-interval. 

Example 6. Suppose that Yi, . . . ,Yn are independent real- valued observations from an 
unknown distribution P, a nonparametric problem. Consider the so-called empirical 
likelihood ratio, given by n" nr=i P({^«})' where P, playing th e role of 6 in the parametric 
problem, ranges over all probability measures on M (see, e.g.. IOwerull988l ). Here interest 
is in a functio nal of P. In partic ular, for a given p G (0, 1), let ip{P) denote the lOOpth 
quantile of P. IWassermanI ( 19901 . Theorem 5) shows that 



T, 



Y,i> 



p 



n 



where 



np li il) = if) 



and ip is the lOOpth sample quantile, the maximum likelihood estimate. The distribution 
of Ty^-^ depends on P only through ip, so the marginal plausibility function is readily 
obtained via Monte Carlo. However, the discreteness will make the inference somewhat 
conservative. Regarding efficiency, I report on a small simulation study. For standard 
normal, exponential, and Student-t (df = 3) distributions, 1000 samples of size n = 100 
are taken and, for each, a 95% plausibility interval for the 75*^^ percentile is calculated. 
The estimated coverage probabilities are 0.957, 0.962, and 0.959, respectively. This sug- 
gests, as expected, that the plausibility intervals are conservative. Similar results are seen 
for different percentiles and for smaller/larger samples sizes. 

Example 7. Consider a bivariate Gaussian distribution, like in Example [3l but this time 
with all five parameters unknown. That is, the unknown parameter is ^ = {ip, A), where 
the correlation coefficient ip is the parameter of interest, and A = ( p,^ , p.^,, a^ , a^) is the 



nuisance parameter. From the calculations in ISun and Wong] ( 120071 ). the relative profile 
likelihood is 

Ty,,={(i-^Y/^(i-^^)^/v(i-v^^)r, 

where ip is the sample correlation coefficient. It is clear from the previous display and 
basic properties of tp that the distribution of Ty^^ is free of A; this fact could also have been 
deduced directly from the problem's composite transformation structure. Therefore, for 
the Monte Carlo approximation in ([7]), data can be simulated from the bivariate Gaussian 
distribution with any convenient choice of A. 

For my first illustration, I replicate a simulation study in lSun and Wong ( 20071 ). Here 
10,000 samples of size n = 10 from a bivariate Gaussian distribution with A = (1, 2, 1, 3) 
and various ip values. Coverage probabilities of the 95% plausibility intervals are displayed 
in Table [TJ For comparison, coverage probabilities for s everal other me thods are also 
given: z is Fisher's classical interval; z^ and /a are due to iHotellind ( 1l953l ): Zhr is due to 
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Correlation, tp 



Method 



-0.9 



-0.5 



0.0 



0.5 



0.9 



z 


0.9527 


0.9525 


0.9500 


0.9517 


0.9542 


Zi 


0.9499 


0.9509 


0.9494 


0.9502 


0.9516 


Zhr 


0.9643 


0.9577 


0.9550 


0.9577 


0.9610 


h 


0.9507 


0.9514 


0.9517 


0.9526 


0.9506 


R* 


0.9488 


0.9500 


0.9517 


0.9508 


0.9492 


PB 


0.9385 


0.9425 


0.9453 


0.9438 


0.9411 


pl 


0.9492 


0.9496 


0.9502 


0.9505 


0.9509 



Table 1: Estimated coverage probabilities of several 95% intervals for il) in the Example [7] 
simulation. First five rows are taken from Table 1 in lSun and Wong ( 120071 ): last two rows 
correspond to parametric bootstrap (PB) and plausibility intervals, respectively. 



95% intervals for ip 
Method Lower Upper Length 



z 


-0.919 


-0.460 


0.459 


Z4 


-0.919 


-0.462 


0.457 


Zhr 


-0.915 


-0.439 


0.476 


h 


-0.912 


-0.447 


0.465 


R* 


-0.913 


-0.449 


0.464 


PB 


-0.924 


-0.488 


0.436 


pl 


-0.918 


-0.461 


0.457 



Table 2: Intervals for estimating the correlation ip for the fat-gain data from iLevine et al. 
(119991 ) in Example [71 First five rows are taken from Table 2 of ISun and WongI (120071 ): 
last two rows show the parametric bootstrap (PB) and plausibility intervals, respectively. 



Rubenl (Il966r): and R* is due tolSun and Wong (120071 ). More details about these methods 



Sun and WongI (120071 ). Results for standard parametric bootstrap intervals 



are given m 

(with 5000 bootstrap samples) are also displayed. In this case, based on the first three 
significant digits, z, z^, f^, and R* perform reasonably well, but the parametric bootstrap 
intervals suffer from drastic under-coverage. The plausibility intervals are apparently the 
most accurate across the range of ip values. 

Second, I revisit the real-data example in ISun and Wong! (120071 ). The data measures 



the i ncrease in energy use yi and the fat gain ?/2 for n = 16 individuals (iLevine et al 



19991 ). Table 121 lists the 95% intervals for ip based on same seven methods in Table [H The 
plausibility interval is also one of the two shortest in this example, among the methods 
with good performance in the simulation study above. 

Example 8. Consider a gamma distribution with mean ip and shape A; that is, the density 
is pe{y) = T{X)~^{X/^)^y^~^e~^y^'^ , where 9 = (V^, A). The goal is to m ake inference on 
the rn e an ip. Likelihood - based solutions to this pr oblem are presented inlGrice arid Bain 
(ll980h . [iYaser and ReidI (ll989h . [iYa"ser et al.l(ll997h : see. also. lKulkarni and Powaii (l2010h . 
In the present context, it is straightforward to evaluate the relative profile hkelihood Ty^i/i 
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95% intervals for tp 




Method 


Lower Upper 


Length 


classical 


96.7 130.9 


34.2 


WonsT ri993) 


97.0 134.7 


37.7 


Eraser et al. ('1997) 


97.2 134.2 


37.0 


plausibility 


97.1 133.6 


36.5 



Table 3: Interval estimates for the gamma mean ip for the radiation data in Example El 

in (IT^ . However, it is apparently difficult to check if the distribution function F^ of Ty^^ 
depends on nuisance shape parameter A. So, following the intuition above, I shall assume 
that it has a negligible effect and fix A = 1 in the Monte Carlo step. That the results 
are robust to fixing A = 1 in large samples is quite reasonable, but what about in small 
samples? In simulations of samples of size n = 10 from a variety of {ip, A) combinations, 
I find that the choice of A has no effect, in that the marginal plausibility intervals for ijj 
(not shown) always have exact coverage, up to Monte Carlo error. 

For illustration, I consider a real- data example concernin g the survival times of n = 20 



rats exposed to 240 rads of radiation (JGross and Clarklll975l ). TableOshows 95% intervals 



for ip based on four different methods: the classical ffist-order a ccurate appro ximation; 



the second-order accurate parameter-averaging ap proximation of IWond (119931 ): the best 



of the two third-order accurate approximations in iFraser et al.l (119971 ): and the marginal 
plausibility interval. In this case, the marginal plausibility interval is shorter than both 
the second- and third-order accurate confidence intervals. 

As a last word here , I men tion briefly that, for the ceramic strength data considered 



m 



Kulkarni and Powarl (120 lOl ). the marginal plausibility interval for ip is (992.3, 1213.2), 



which is noticeably shorter than the intervals they recommend. 

3.4 Two special applications 

3.4.1 Gaussian random effects model 

An important practical example is the Gaussian random effects model. Suppose Yi, . . . ,Yn 
are independently distributed, with Yi ~ N{fii,af), i = l,...,n, where the means 
/xi, . . . , /i„ are unknown, but the variances af,...,a^ are known. The Gaussian random 
effects portion comes from the assumption that the individual means are an independent 
N{X,ip'^) sample, where 6 = [ip, A) is unknown. In such problems, the typical question of 
interest is if all individual means are the same, i.e., ni = ■ ■ ■ = /i^; in this hierarchical 
framework, the hypothesis of equal means corresponds to a degenerate random effects 
distribution, i.e., ip = 0. So the random effects standard deviation ip is the parameter of 
interest, and the overall mean A is a nuisance parameter. 

Using well known properties of Gaussian convolutions, it is possible to recast this 
hierarchical model in a non-hierarchical form. That is, Yi, . . . ,Yn are independent, with 
Yi ~ N{X,af + ip"^), i = l,...,n. Here the conditional maximum likelihood estimate 
of A, given ip, is the weighted average A^ = Y17=i'^iW^i/ Yl^=i'^ii'^)^ where Wi{ip) = 
1 / {(xf + tp"^) , i = 1, . . . ,n. From here it is straightforward to write down the relative profile 
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Figure 4: Plot of the marginal plausibility function for the variance component if) in the 
SAT coaching example in Section I3.4.1I 

likelihood Ty^^ in f lT^ . Moreover, since the model is of the composite transformation form, 
Ty^tp is A-ancillary so any choice of A (e.g., A = 0) will suffice in the Monte Carlo step 
(IT6|l . One can then readily compute plausibility intervals, etc for ip. 



For a concrete examp le, consider the SAT coaching problem presented in lRubinI (jl98ll ) 
and lGelman et al.l (J2004l . Sec. 5.4). Here n = 8 coaching programs are evaluated, and in- 
ference on if) is required. For this we develop interval estimates of ip based on the marginal 
plausibility interval described above, and an equal-tailed quantile-based parametric boot- 
strap interval with B = 2500 bootstrap resamples. Note that the marginal plausibility 
interval has a coverage probability guarantee while the bootstrap interval, at best, is 
asymptotically correct — with n = 8, asymptotic theory may not meaningful. Figure H] 
shows a plot of the marginal plausibility function for tp for the given data, and the three 
95% interval estimates for ip are, respectively, [0,13.40) and (1.8 x 10"*^, 11.53). Since 
the marginal plausibility interval contains zero, one cannot reject the hypothesis that the 
coaching programs have no effect; the bootstrap interval is extremely close to zero, but 
the choice of an equal-tailed interval makes boundary point containment impossible. 

A simulation study will cast additional light on the relative performance of the two 
methods above, in particular, for the important case of ip near the boundary, i.e., ip ^ 0. 
Since the two-sided bootstrap interval cannot catch a parameter exactly on the boundary, 
I shall consider true values of ip getting closer to the boundary as the sample size increases. 
In particular, for various n, I shall take the true ^ = n~^l'^ and compare interval estimates 
based on coverage probability and mean length. Here data are simulated independently. 



according to the model Yi ~ N(0,crj + ?/; 



,n. 



where ^/^ is as above and the 



(Ti, . . . , o"„ are iid samples from an exponential distribution with mean 2. Table H] shows 
the results of 1000 replications of this process for four different sample sizes. Observe 
that the plausibility function method hits the target coverage probability on the nose for 
each n, while the bootstrap method suffers from drastic under- coverage for all n. 



YI 



Plausibility Bootstrap 



n 


Coverage 


Length 


Coverage 


Length 


50 


0.952 


0.267 


0.758 


0.183 


100 


0.946 


0.162 


0.767 


0.138 


250 


0.948 


0.079 


0.795 


0.079 


500 


0.950 


0.041 


0.874 


0.039 



Table 4: Summary of the 95% intervals — estimated coverage probabilities and expected 
lengths — for ip in the random effects model simulations in Section 13.4.11 

3.4.2 Variable selection in Gaussian linear regression 

This final example demonstrates the proposed method's ability to handle non-standard 
inference problems, in this case, variable selection in regression. Here I give a relatively 
simple and straightforward plausibility function-based approach. 

Consider the Gaussian linear model F = /3oln + X[3 + as, where Y = (Yi, . . . , YnY is 
a n- vector of response variables, X is a (fixed) n x p matrix of predictor variables, Po^n 
is a constant n-vector intercept, /3 = (/3i, . . . , Pp)'^ is a p-vector of unknown regression 
coefficients, a > is an unknown scale parameter, and e = (ei, . . . ,£n)'^ is a n-vector 
of standard Gaussian noise. The goal is to select a subset of variables (columns of X) 
that suitably describe the variation in the response Y. Towards this, let 7 G {0, 1}^ 
index the collection of submodels, with /3^ the corresponding subvector of /3. For any 
given 7, evaluation of the marginal plausibility function for /3^ is immediate, since the 
relative profile likelihood for /3^ has distribution free of all other parameters; in fact, it is 
distributed as a simple transformation of a F-distribution, so no Monte Carlo methods 
are required. Now consider the assertion A^ = {/3i_^ = 0}, i.e., that model 7 is not 
wrong. Evidence in the observed y for model 7 can, therefore, be measured by 

with the subscript "[/3i-'y = 0]" emphasizing the fact that the calculation is based on 
the marginal plausibility function for /3i_^, evaluated at zero. As A^ corresponds to a 
singleton assertion with respect to /3i_^, the frequentist calibration in Theorem [2] holds. 
For inference on the number of non-zero coefficients in /3, one can consider 

f^P^yik) = max mpl (A^), (17) 

7;|7|<fc 

the plausibility that the model has at most k non-zero coefficients. It follows from the 
calibration property quoted above that a rule that rejects the claim of at most k non- 
zero coefficients when mpl^(/c) < a with control the frequentist Type I error rate at a. 
Therefore, a reasonable model selection strategy is to choose k* variables, where k* is the 
smallest k such that mp\y{k) < a. 



For a numerical illustration, consider the diabetes data set analyzed in lEfron et al. 



( I2OO4J ). These data consist of observations for n = 442 diabetes patients with p = 10 
covariates: age, sex, body mass index (bmi), average blood pressure (map), and six blood 
serum measurements (tc, Idl, hdl, tch, Itg, and glu). In this case, there are 2^° = 1024 
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Figure 5: Plot of the marginal plausibility function mpl (/c) in (IT7|) versus k for the 
diabetes data regression problem in Section I3.4.2I 



possible models; that mpl^(y4^) can be evaluated in closed-form, without Monte Carlo, 
makes the computations very fast, even for much larger p. A plot of mpl (A;) versus k 
for these data is shown in Figure O Here, with a = 0.05, the method selects k* = 5 
variables. Of all models 7 with five variables, the one with largest mplj^(A^) consists of 
sex, bmi, map, hd l, and Itg. For com parison, these five variables are a subset of the seven 
select ed by lasso (lEfrori et al.l 12004 ) and a superset of the four selected by the Bayesian 
lasso (jPark and Casellall2008l ). 



4 Additional remarks 



Remark 1. It was pointed out in Section [2.11 that the relative likelihood ([T]) is not the 
only possible choice for Tyg. For example, if the likelihood is unbounded, then one might 
consider Ty^g = Ly{9). A penalized version of the likelihood might also be appropri- 
ate in some cases, i.e., Ty^ = Ly{6)7r{6,y), where vr is something like a Bayesian prior 
(although could depend on y too). This could be potentially useful in high- dimensional 
problems. Anothe r inte r esting class of Tyg quantities are those motivated by higher-order 
asymptotics, as in iReidI ( 20031 ) and the references therein. Choosing Tyg to be Barndorff- 
Nielsen's r* = r*{6,y) quantity, or some variation thereof, could potentially give better 
results, particularly in the marginal inference problem involving 6 = {ip,X). However, 
the possible gain in efficiency comes at the cost of additional analytical computations, 
and, based on my empirical results, it is unclear if these refinements would lead to any 
noticeable improvements. 

Remark 2. I have motivated ply(A) as a mea sure of th e evid ence in y in favor of the 
claim "6' G A." It does satisfy the conditions in lSchervishI (119961 ) for measures of support. 
Bayesian posterior probabilities are probably the most common alternative measures of 
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evidence. The fundamental difference between plj/(A) and a Bayesian posterior probability 
is that the former is not a probability measure at all. Although < pL(A) < 1 for all A, 
it can be shown that p\y{A) + p\y{A'^) > 1. While this property is strange and unfamiliar, 
it does have some intuition. That is, it is possible that after seeing y, the mutually 
exclusive claims "6* G A" and "6* ^ A" are both highly plausible, e.g., if sample size is too 
small to clearly distinguish which region 6 lives in. But posterior probabilities necessarily 
satisfy the complementation rule, so if one claim has high posterior probability, then 
its complement cannot. An important but extreme case is that of a singleton assertion 
"^ = ^0-" A (typical) Bayesian posterior assigns zero posterior probability to this claim, 
no matter how strongly the data points to it. 

Remark 3. There has been considerable efforts to construct a framework of post-data 
probabilistic inference without prio rs; these include Fisher's fiducial inference, general- 



ized fiducial inference ( lHannigll2009l ). and the Dempster-Shafer theory of belief functions. 



Although pl„(-) is not a probability measure on 2 , it does have a prior-free posterior prob 



abilistic interpretation via the random sets Qy{U) from ([3]) with uniform U. Moreover, 
the frequentist results presented herein imply that p\y{A), as a measure of evidence in 
support of the claim "^ G A" i s properly calibrated an d, therefore, also meaningful across 
users and/or experiments. See Martin and Liul ( 20121 ) for more along these lines. 



Remark 4. For the kinds of problems considered here, the parametric bootstrap is proba- 
bly the most widely used numerically-driven frequentist inference methods. The simplest 
version develops tests or confidence regions by specifying a statistic, and then approxi- 
mating the sampling distribution of that statistic by an empirical distribution obtained 
by repeated sampling from P^. The trouble, of course, is that if 6 is far from 6, then the 
distribution used in the Monte Carlo step is wrong and the results can be misleading. 
There are other known issues with bootstrap, such as the boundary effect observed in Sec- 
tion 13.4. H which the proposed method seems to avoid. The point is that bootstrap gives 
only asymptotically approximate inference, while the plausibility function-based method 
which gives exact inference in general. On the other hand, the parametric bootstrap, 
and mo difications thereof, perform surprisingly w ell in some problems, even with small 
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Lee and Youngll2005l ). 



n [e.g. 

Remark 5. Using Monte Carlo approximations ([7]) and (IT6|) to construct exact frequentist 
inferential procedures is, to my knowledge, new. Despite its novelty, the method is 
surprisingly simple and general. On the other hand, there is a computational price to 
pay for this simplicity and generality. Specifically, determination of plausibility intervals 
requires evaluation of the Monte Carlo estimate of FoiTyo) for several 9 values. This can 
be potentially time-consuming, but running the Monte Carlo simulations for different 
9 in parallel can help reduce this cost. The proposed method works — in theory and 
in principle — in high- dimensional problems, but there the computational cost is further 
exaggerated. An important question is if some special techniques can be developed for 
problems where only the nuisance parameter is high- or infinite-dimensional. Clever 
marginalization can reduce the dimension to something manageable within the proposed 
framework, making exact inference in semiparametric problems possible. 
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5 Discussion 

In this paper, I have a developed a framework for exact frequentist inference driven by a 
class of plausibility functions, relying critically on numerical methods, specifically Monte 
Carlo, for implementation. Finite- and large-sample theory of the plausibility functions 
has been presented, with implications on the performance of the corresponding inference 
procedures, and a number of examples were given for illustration and comparison with 
existing methods. The computational cost of the proposed method is higher than that of 
other (analytical or numerical) approximate inferential methods, but I would argue that 
exactness is worth the price. The Bayesian must face a similar choice: use a simple prior 
with a closed-form solution, or a more realistic prior with more expensive computations. 
With the available computational resources, the Bayesian tends to opt for the latter. 
And, of course, computational resources can only increase in years to come. 

Finally, it is a common belief that inferential methods based on higher-order ap- 
proximations are seldom used in practice because of the difficult analytical calcula- 
tions in volved. Some progress has been made to rectify this with the hoa package in 



R (e.g., iBrazzale and DavisonI 120081 ). For one-dimensional problems, it would be rela- 
tively straightforward to package a default program to implement the proposed method. 
Users could simply input the data, the likelihood function, and a function to perform the 
sampling in ([7]) and let the program to the rest. But despite the popularity of black-box 
software packages, I would argue that this is not the appropriate direction to take. It 
is simply unrealistic to expect that exact inferential methods can be obtained efficiently 
and generically. I have made my R codes available (www.math.uic.edu/~rgmartin) to 
give interested readers hints for how they might apply the method to their own problems. 
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A Proofs of theorems 

Proof of TheoremUl Take any a G (0, 1) and any 9 & A. Then, by definition of ply(A) 
and monotonicity of the probability measure Pg, I get 



Pe{?\Y{A) <a] = Pejsupee^ Fe(Ty,e) < a} < Pe{Fe{TY,e) < «}• 



(1^ 



The random variable Ty^, as a function F ~ P51, may be continuous or not. In the contin- 
uous case, Fq is a smooth distribution function and FeiTyfi) is uniformly distributed. In 
the discontinuous case, Fq has jump discontinuities, but it is well-known that FoiTyfi) is 
stochastically larger than uniform. In either case, the latter term in f lT5]l can be bounded 
above by a. Taking supremum over 9 E A throughout flTSj) gives the result in ([H]). □ 
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Proof of Theorem\^ In this case, ply (6*) = Fg{TYfi) so no optimization is required com- 
pared to the general A case. Therefore, the "<" between the second and third terms in 
flTSj) becomes an "=." The rest of the proof goes exactly like that of Theorem [H D 



Proof of Theorem\^ For Y ~ Pg, pick any fixed 9q and choose corresponding g^g such 
that 9 = gOo] such a choice is possible by transitivity. Let Yq ~ Pg^^, so that gYo also has 
distribution Pg. Since TgYo^gOo = Tyq^o by the "bi-invariance" assumption fITOl) . it follows 
that Tyg eg has distribution free of 6, proving the claim. D 

Proof of Theorem^ First, take any r] G (0, 1) and any A G Q. Then 

Pg*{pUA) >r]} = Pg J Slip Fg{Ty,g) > 1]] 

< Pg* I sup Fg ( sup Ty^g ) > r]\ 

'-e&A ^e&A ' ^ 
= P9JsupTy.e>infF,-i(r/)| 

where the last inequality follows because Ln{0*) < Ln{9). This proves (TTT]) . Outer 
probabilities may be needed on the right-hand side if measurability is not guaranteed. 

Now take any assertion A ^ 9* . Then there exists an open || ■ ||-ball B centered at 9* 
such that A C B^. Since the Hellinger distance h is bounded, up to a constant, by || ■ ||, 
there exists e > such that A^ = {9 : h{9\ 9) > e} D A. From f lT9|) and f lT3|) . 

Ln{9) 



Pg^plM) >v}< P.4sup -^ > inf Fjiv)} 






By assumption flT^ and Theorem 1 of IWong and ShenI ( 19951 ). there is a constant c > 



such that the upper bound itself is bounded by 4e~'^'^^ . It follows that, for any tj e (0, 1), 
the upper bound (l20l) for Pe*{pl„(y4) > 77} is summable over n > 1. An application of 
the Borel-Cantelli lemma gives that pl„(^) — >■ with P6i*-probability 1. Since A^ 9* is 
arbitrary, the plausibility function is consistent. D 
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